Fix Microdata parser not understanding itemscope properly#2711
Fix Microdata parser not understanding itemscope properly#2711christianlupus merged 7 commits intonextcloud:masterfrom
Conversation
Closes nextcloud#1617 Signed-off-by: Krzysztof Haładyn <krzys_h@interia.pl>
60718ca to
a937535
Compare
Test Results 12 files 588 suites 1m 54s ⏱️ Results for commit a8aa43d. ♻️ This comment has been updated with latest results. |
Signed-off-by: Christian Wolf <github@christianwolf.email>
There was a problem hiding this comment.
I added the HTML pages from the linked issue as test cases and also the test case from you. I tried to extract useful output information mainly based on the current implementation (so, this might be wrong, please check the added JSON files).
All in all, your tests made quite some tests fail. I do not see any obvious errors in the test data. Can you please elaborate what is the problem?
Adding a changelog will at least make this test succeed easily.
| */ | ||
| private function searchChildEntries(DOMNode $recipeNode, string $prop): DOMNodeList { | ||
| return $this->xpath->query(".//*[@itemprop='$prop']", $recipeNode); | ||
| return $this->xpath->query(".//*[@itemprop='$prop' and count(ancestor::*[@itemscope]) = 1]", $recipeNode); |
There was a problem hiding this comment.
I am unsure about this XPath stuff right now. Especially the hard-coded 1 makes me nervous.
Also one test case, that I can think of (and that might cause problems) but I have not found a concrete example of: Let the top-most element be a non-recipe. Like a WebPage. The WebPage is about a Recipe. That way, the Recipe is no longer the top-most element. How will ancestor::*[@itemscope] then behave? Will it detect the recipe's props or the Webpages?
Maybe you could even create a test case (similar to the PR introduction) and add similarly to the test suite?
There was a problem hiding this comment.
Most of the failing test cases fail because... they are not actually valid Microdata. If you run them through https://validator.schema.org/ you will see that it can't see any elements either, same goes for https://foolip.org/microdatajs/live/. The spec says:
The
itemtypeattribute can only be specified on elements which have anitemscopeattribute specified.
https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/itemtype
But these files specify an itemtype without an itemscope. I added this requirement to make sure the lookup for the parent itemscope works correctly. We can try to relax this requirement, but if even the official validator can't see them, then I don't expect this happens often in the real world. How would you like me to proceed here, add the missing itemscope to the test cases, or make the parser more lenient?
This is the case for: caseB.html, caseC.html, caseD.html, caseE.html, caseImageAttribute.html, caseImageContent.html, caseIngredient.html and caseInstruction.html.
The only actually failing test case is caseFix1209.html, and it seems to be exactly the issue with the parent WebPage itemscope you mentioned. I'm not really good with XPath so I wrongly assumed the ancestor query would be limited to the subtree of the node specified through $recipeNode, but this it not the case. Sadly I can't find a way around this without using XPath 2.0, which is not supported by DOMXPath. The easiest way I can see to fix this is to create a child DOMDocument containing only the recipe using DOMDocument::importNode, create a separate DOMXPath instance for it, and keep the current query that checks if we are not deeper than 1 itemscope from the root. Would you like me to do that, or perhaps we should replace the XPath queries with manual iteration over the nodes in PHP?
There was a problem hiding this comment.
OK, I think, I fixed the cases you mentioned (B through E, Image*, as well as Ingredient and Instruction). I think I fixed these in the tests.
Regarding the test Fix1209, this was added for #1209 and its fix in #1220. I am a bit more concerned as this might break the existing code somewhat. For the given test case, this is maybe even a good break (as it actually separates the instructions at least partially). However, it drops the first instruction without notice.
I marked it as incomplete for the meantime. If you see any good way to fix this or to work around it, feel free to give me a hint (or better open another PR 😉 )
Regarding the new parser routines, please do not put too much effort into it. There is a plan to rework the complete parser routines (also for the JSON+LD and other parsers). So, this would be wasted time. You are welcome to help with the rework if you want to. The point is to put the parsers on a common base.
Signed-off-by: Christian Wolf <github@christianwolf.email>
Signed-off-by: Christian Wolf <github@christianwolf.email>
Signed-off-by: Christian Wolf <github@christianwolf.email>
Signed-off-by: Christian Wolf <github@christianwolf.email>
Signed-off-by: Christian Wolf <github@christianwolf.email>
4173391 to
a8aa43d
Compare
christianlupus
left a comment
There was a problem hiding this comment.
LGTM
The missing changes are longer term and this should only be a quick fix PR. So, let's do smaller steps.
Topic and Scope
This makes the simple Microdata parser properly handle nested objects. Here is a minimal repro example of the bug it fixes:
Previously, the import would crash because both "name" props would be interpreted as the name of the recipe. Now, itemprops nested inside child itemscopes are ignored.
Closes #1617
Concerns/issues
This PR should be extended with unit tests, but I don't have the entire dev environment set up to do it right now. I mostly just wanted to bodge a fix for personal use because the bug bothered me. I only tested that the three URLs mentioned in #1617 work fine now.
Formal requirements
There are some formal requirements that should be satisfied. Please mark those by checking the corresponding box.